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Introduction 
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Snapshot of the Parallel tools world 

■ Today’s architectures arround us: 
n Multicore architectures 

□ Distributed architectures 

□ Many Core architectures 


■ Ok but, your single core is calling you and reminds you: 
n I have a SIMD instructions set! 


What’s SIMD? 
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Principles 

■ Single Instruction, 
Multiple Data 

■ Operations applied on 
NxT elements within 
a single register 


■ Instructions 
□ Data 

■ Results 


■ Up to N times faster 
than regular 
ALU/FPU 
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Why is SIMD abstraction needed? 


metascale 


x86 family 

■ MMX 64-bit float, double 

■ SSE 128-bit float 

■ SSE2 128-bit int8, intl6, int32, 
int64, double 

■ SSE3 

■ SSSE3 

■ SSE4a (AMD only) 

■ SSE4.1 

■ SSE4.2 

■ AVX 256-bit float, double 

■ FMA4 (AMD only) 

■ XOP (AMD only) 

■ FMA3 


PowerPC family 

■ AltiVec 128-bit int8, intl6, int32, 
int64, float 

■ Cell SPU 128-bit int8, intl6, 
int32, int64, float, double 

ARM family 

■ VFP 64-bit float, double 

■ NEON 64-bit and 128-bit double, 
float, int8, intl6, int32, int64 
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Why is SIMD abstraction needed? 
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x86 family 

■ New microarchitecture Intel 
Haswell with AVX2 

□ 256 bits SIMD Unit 

□ Customized integer 
intrinsics 

□ scatter/gather 

■ New microarchitecture Intel 
MIC 

□ 512 bits SIMD Unit 


PowerPC family 

■ AltiVec VMX128 in the XBOX 
360 

■ Power7 Altivec with int64 and 
double 

ARM family 

■ NEON2 NEON + double 
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Why not let the compiler do it? 
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Compilers are only so smart 

■ Automatic vectorization can only happen if: 

□ Memory is well agenced 

□ Code is inherently vectorizable 

■ Compiled functions are not vectorized 

■ Compilers don’t always have enough static information to know what they can 
vectorize 


Conclusion 

■ Explicit SIMD parallelism is still the good way to get your code vectorized 

■ High maintainance cost for each new extension without performance loss 
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High Level Abstraction 

■ Embedded Domain Specific Language (EDSL) 

□ SIMD register abstraction as data pack 

□ Expression level gives wide optimizations opportunities 

STL integration 

■ Writing generic C++ code 

■ Reuse Container - Iterator - Algorithm abstraction 

■ Blend with modern C++ Concepts (Range...) 

How can we introduced SIMD here? 

■ Keep a STL based interface. 

■ Write SIMD code like scalar code. 


What are we proposing? 


Talk Layout 


Interface 
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Writing it by hand 
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Doing a * b + c with vectors of 32-bit integers : SSE4.1 

__ml28i a, b, c, result; 

result = _mm_mullo_epi32 (a , _mm_add_epi32 (b , c)); 


Doing a * b + c with vectors of 32-bit integers : Altivec 

..vector int a, b, c, result; 

result = vec.cts ( vec.madd ( vec.ctf (a,0) 

, vec.ctf (b , 0) 

, vec.ctf (c , 0) 

) 

, 0 ) ; 
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The pack abstraction 
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simd: :pack<T> 

pack<T, N> SIMD register that packs N elements of type T 
pack<T> automatically finds best N available 

Behaves just like T except operations yield a pack of T and not a T. 


Constraints 

■ T must be a fundamental arithmetic type, i.e. ( un) signed char, (unsigned) short, 

(unsigned) int, (unsigned) long, (unsigned) long long, float Or double. 

® bool support is done with logical<T>. 

■ N must be a power of 2. 
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pack API 
Operators 

■ All overloadable operators are available : 

pack<T> (D pack<T> , pack<T> (D T , T (D pack<T> 

■ Type coercion and promotion disabled 

uint8_t (255)+ uint8_t(l) yields uint8_t(0), not int(256) 

Comparisons 

■ ==, ! = , <, <=,> and >= perform comparisons and return SIMD vectors. 

■ compare_eq, compare jneq, ... as functions return the result of the lexical 
comparison between the inputs as a booi. 

Other properties 

■ Models both a RandomAccessFusionSequence and RandomAccessRange 

■ at_c<i>c P ) or P [i] can be used to access the i-th element 
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pack API 
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Memory access 

■ Memory must be aligned on sizeof(T)*N to load/store a P ack<T, n> from/to a t*. 

■ Errors asserts in debug mode. 


12 of 40 


pack API 
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Memory access 

■ Memory must be aligned on sizeof(T)*N to load/store a P ack<T, n> from/to a t*. 

■ Errors asserts in debug mode. 


Examples 

ioad< P ack<T, n> >( P , i) loads pack at aligned address P + i 

Main Memory 


0D 

0E 

OF 

10 

11 

12 

13 

14 

15 

16 

17 

18 


10 

11 

L 

12 

13 



load<pack<f Ioat»(0x1 0,0) 
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pack API 
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Memory access 

■ Memory must be aligned on sizeof(T)*N to load/store a P ack<T, n> from/to a t*. 

■ Errors asserts in debug mode. 


Examples 

ioad< P ack<T, n>, offset>( P , i) loads pack at address P + i + offset, p + i must be aligned. 

Main Memory 



load<pack<f Ioat>,2>(0x1 0,0) 


Expression Template to the rescue ! 
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Rationale 

■ Most SIMD ISA have fused operations (FMA, etc...) 

■ We want to write simple code but yet get best performances out of these 

■ We need lazy evaluation : Boost. Proto to the rescue ! 


Advantage 

■ All expressions, even those involving functions, generate template expressions 
that are evaluated on assignment or in the conversion operator 

■ a * b + c is mapped to fma(a, b, c) 
a + b * c is mapped to fma(b, c, a) 
i (a < b) is mapped to is_nle(a, b) 

■ The optimization system is open for extensions 
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Expression Template to the rescue ! 
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Generative programming idioms used 



Extra arithmetic, bitwise and ieee operations, 
predicates 
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Arithmetic 

Bitwise 

■ 

Id exp 

■ 

saturated arithmetic 

■ select 

■ 

n ext /p rev 

■ 

float/int conversion 

■ andnot, ornot 

■ 

ulpdist 

■ 

round, floor, ceil, 

■ popcnt 

Predicates 


trunc 

■ ffs 

■ 

comparison with zero 

■ 

sqrt, hypot 

■ ror, rol 

■ 

negation of 

■ 

average 

■ rshr, rshl 


comparison 

■ 

■ 

random 

min/max 

■ two power 

■ 

is.unord, is_nan, 
isJnvalid 

■ 

rounded division and 

IEEE 

■ 

is_odd, is_even 


remainder 

■ ilogb, frexp 

■ 

majority 
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Reduction and SWAR operations 
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Reduction 

■ any, all 

■ nbtrue 

■ minimum/maximum, 
posmin/posmax 

■ sum 

■ product, dot product 


SWAR 

■ group/split 

■ splatted reduction 

■ cumsum 

■ sort 
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Benchmarks for floats 


metascale 


Timing performance for float (in cycles/value) 
Architecture : Core 17 SandyBridge, AVX 


Function 

Range 

STD or Other 

Scalar 

SIMD 

exp 

[-10,10] 

46 

38 

7 

log 

[-10,-10] 

42 

37 

5 

asin 

[-1,1] 

40 

35 

13 

cos 

[—207 r, 20tt] 

66 

47 

6 

fast_cos 

[-tt/4, tt/4] 

32 

9 

1.3 
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RGB to grayscale 


Scalar version 


float const *red , *green , *blue; 
float * result ; 

f or ( std : : size.t i = 0; i != height *width ; ++i) 

result [i] = 0.3f * red[i] + 0.59f * green [i] + O.llf * blue [i] ; 


SIMD version 


auto rgb = fusion :: make_vector (red , green , blue ) ; 

for (std : : size_t i = 0; i != height ♦width ; i+=pack <f loat > : : stat ic.size ) 
{ 

auto p = load <decltype ( rgb )> (rgb , i); 

auto res = 0 . 3f *f usion : : at_c <0> (p) +0 . 59f *f usion : : at_c <1> (p) 


+0.11f*fusion: :at_c<2>(p) ; 
store(res, result, i); 


> 
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Easy enough, but what if... 
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■ ... I’ve got interleaved RGB or RGBA? 


Interleaved RGB 
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What do we want to do? 


Data in main memory 



Interleaved Iterators outputs 


Problem 

■ On current SIMD extensions, this required a combination of unaligned load 
operations and vector shuffling. 

■ Yes but gather is coming! In AVX2, we will be able to gather nonadjacent 
data elements. 


Proposal 

■ We need to have a proper abstraction for deinterleaving data. 
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Interleaved RGB 


Code example 


float *data , *result ; // Interleaved data start here, 

simd: : interleaved.iterator <f loat ,3> b(data) ; 
simd : : interleaved.iterator <float ,3> e(data+size) ; 
std : : transf orm (b , e, simd :: begin ( result ) 


, [] ( decltype (*b) v) 

{ 

return 0.3f *f usion : : at_c <0> (v) 
+ 0 . 59f *f usion :: at_c <1 > (v) 
+ 0 . Ilf *fusion : : at_c <2>(v) 


} 

) ; 
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Talk Layout 


boost. simd and the Standard Library 
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Operations vs Data 


metascale 


Where/How to store our data ? 

■ SIMD operations require data to operate onto 

■ Usual approaches force a specific container type onto users 

■ Not generic enough 
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Operations vs Data 
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Where/How to store our data ? 

■ SIMD operations require data to operate onto 

■ Usual approaches force a specific container type onto users 

■ Not generic enough 

A better approach 

■ SIMD compliant allocators 

■ SIMD Range and Iterators over ContiguousRange 

■ Adapt our SIMD classes to work with a subset of STD algorithms 
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SIMD allocators 
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Rationale 

■ Allow containers to handle memory in a SIMD compliant way 

■ Handles alignment of memory 

■ Handles padding of memory 

Example 

std :: vector <float , simd :: allocator <float > > v(173); 
assert ( s imd : : i s _al igned (& v [0] ) ); 
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From Range to SlMDRange 
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Iterator interface 

■ Boost. SIMD provides simd: : begin () /simd: :end() 

■ Turn iterators into SIMD iterators returning pack 

■ Take a regular range, iterate over it in SIMD 
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From Range to SlMDRange 


Iterator interface 

■ Boost. SIMD provides simd: : begin () /simd: :end() 

■ Turn iterators into SIMD iterators returning pack 

■ Take a regular range, iterate over it in SIMD 

Example 


simd::begin(v.begin()) 


simd::end(v.end()) 




v 


10 11 12 13 14 15 16 17 18 


19 1A IB H 


A 


A 


v.beginQ 


v.endQ 
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From Range to SlMDRange 
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Iterator interface 

■ Boost. SIMD provides simd: : begin () /simd: :end() 

■ Turn iterators into SIMD iterators returning pack 

■ Take a regular range, iterate over it in SIMD 

Example 

std :: vector <float , simd :: allocator <float > > v(1024); 
pack<float> x,z; 

x = boost : : accumulate (simd : : range (v) , z) ; 
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From Range to SlMDRange 
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Iterator interface 

■ pack provides beginO /end() 

■ Directly usable in STD algorithms 

■ Directly usable in Boost. Range algorithms 

Example 

pack<float> x(l,2,3,4); 

float k = std :: accumulate (x . begin () , x.endQ, O.f); 
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SIMD values as Range 
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Putting everything together 

std :: vector <float , simd :: allocator <float > > v(1024); 
pack<float> x,z; 
float r; 

x = boost : : accumulate (simd : : range (v) , z) ; 
r = std :: accumulate (x . begin () , x.endO, O.f); 


The proper way 

std :: vector <float , simd :: allocator <float > > v(1024); 
pack<float> x,z; 
float r; 

r = simd :: accumulate ( simd :: range (v) , pack <f loat > ( ) ); 
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SIMD values as Range 
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std: : accumulate speed-up for float 
Architecture: Core2Duo Arrandale, SSE2. 


5 



4 16 64 256 1024 4096 16384 

2 8 32 128 512 2048 8192 
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An other example : SIMD Filtering 
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Unidimentional filtering 


Sliding window 
> 



VR = 1/3 * ( load<-1>(vx) + load<0>(vx) + load<1>(vx) ) 









Another example : SIMD Filtering 


metascale 


Generic SlMD/scalar code 

template <class Rangeln, class RangeOut > 

inline void average ( RangeOut result, Rangeln input ) 

{ 

typedef typename Rangeln :: iterator in.iterator ; 
typedef typename RangeOut :: iterator iterator; 
typedef typename iterator.value <iterator > : : type type; 

iterator br = result . begin () , er = result . end () ; 
in.iterator data = input . begin () ; 

br++; er--; 

while ( br ! = er ) 

{ 

type xml = load<type, -l>(data,i); 
type x = load< type >(data,i); 
type xpl = load<type , +l>(data,i); 
res = = l.f/3 * (xml + x + xpl); 

store (res , i , 0) ; 

} 

> 
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An other example : SIMD Filtering 
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Unidimentional filtering 

■ Hum... seems like something recurrent in the image/signal 
processing world. 


Proposal 


Data in main memory 


10 

11 

12 

13 

14 

15 

16 


Shifted Iterator output for N=3 


10 1 11 

12 

13 



1 ” 

12 

13 

14 



12 

13 

14 
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■ Abstract the sliding window computation pattern with a 
shiftecLiterator. 
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Another example : SIMD Filtering 
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Looks like this ! 


float *data , *result ; 

simd: : shif ted.iterator <f loat ,3> b(data) ; 
simd : : shifted_iterator<float ,3> e(data+size) ; 
std : : transf orm (b , e, simd :: begin ( result ) 

, [] ( decltype (*b) v) 

{ 

return 1 . f /3* ( v [0] +v [1] +v [2] ) ; 

} 

) ; 
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Talk Layout 


Results 
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RGB2YUV 
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Boost.SIMD vs Scalar C++ vs Handwritten SIMD code 
Image : 256 x 256 float 



SSE4.2 AVX Altivec 


Scalar C++ 

■ Reference SIMD 
Boost.SIMD 
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Sigma Delta Motion detection Algorithm 
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Thomas Tri-diagonal Solver 
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Algorithm 1: Thomas algorithm: solve Ax=s. 
input : A tridiagonal system. 

• a[n] : sub-diagonal 

• b[n] : main diagonal 

• c[n] : sup-diagonal 

• s[n] : right hand side 
begin 

Forward elimination: 
for i=2,...,n do 

6=a[i]c[i-1]/b[i-1] 
b[i] = b[i] - 6 
6 S = a[i]s[i - 1]/b[i - 1] 
s[i] = s[i] - 6 S 

end 

Backward substitution: >{n] = s[n]/b[n] 
for i=n-1,...,1 do 
| x[i] = s[i]-c[i]x[i + 1]/b[i] 
end 


Thomas - Scalar Fortran Code vs SIMD version 

90 



S& <£> 


-vV fcV <oV 


-vV 


& c$> c$> 


System size 
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Talk Layout 


Conclusion 
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Current Work 
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Integrated SIMD support 

■ Most STD algorithms should be specialized to be run in one scope 

■ Can we have a Boost. Range adaptor like simd(r) ? 

■ Support for shifted Range using load<T,N> and unaligned Range 


What’s on Fire ? 

■ ISA improvements for ARM, MIC 

■ Submission to Boost C++ Libraries... soon... 
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Overview of Boost.SIMD 
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Our goals 

■ Bring SIMD programing to a usable state 

■ If we have Boost .Atomic, why not Boost.SIMD ? 

■ Be attractive by being nice with the rest of C++ 


What we achieved 

■ Demonstrated some impacts in term of performance 

■ Made using SIMD almost as simple than scalar 
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Thanks for your attention 


